Main¶

Data investigation/cleaning¶

We will be investigating the dataset spambase.csv from OpenML-100 databases. This database concerns emails, of which some were classified as spam emails (~39%), whereas the rest were work and personal emails. After getting rid of the first index column, and making sure there are no NAN cells, we are left with a 4601x56 dataframe.

According to found documentation (https://www.openml.org/search?type=data&sort=runs&id=44&status=active), the columns of the dataframes are the following:

  • 48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.
  • 6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail
  • 1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters
  • 1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

Each word frequency is on average low (<1% of all words), however in some mails a word can make up to 40-50% of all the words of a mail. Similarly with the specified characters. The average mean capital length between mails was ~5 characters, however the median was ~2 and the maximum was >1100 indicating a skew to the right.

Models training¶

After initial investigation and cleaning of the data, we proceeded with training models to predict whether a mail is spam. We plan to train 3 models:

  • Simple Logistic Regression (no hyperparameters)
  • Random Forest (few hyperparameters we will try to optimize for)
  • TabPFN* - taken straight from their library with no fine-tuning.

We divided our dataset into 90% train and 10% eval subsets. During training of our models we did a 5-fold cross-validation on the training set and then evaluated the final performance on the eval set. Cross-validation helps with the potentially high-variance single split of train-test data, which could skew the results. The cross-validation train-test results shown later will be the averaged train and test scores of the model on the 5 runs of the cross-validation process. The completely separate eval set helps with models we fine-tuned for hyperparameters - once we select hyperparameters using cross-validation, we run the final model once on the eval set to get an unbiased performance indicator.

*Due to complexity limitations of TabPFN, we trained it only on a random 1000 rows of the input data. (Larger amounts of data returned an error)

Models results¶

Below we present a table with the results of each of the models. Since this is a 0-1 classification task, we operate on the accuracy metric. For the RF model, we show the optimal one we found using cross-validation. The CV Train and CV Test columns show the averaged accuracy over the 5 runs of cross-validation.

CV train accuracy CV test accuracy Eval accuracy
Logistic Regression 0.927 0.922 0.941
Random Forest* 0.958 0.936 0.961
TabPFN** 0.987 0.929 0.967

*Hyperparameters selected: (n_estimators, max_depth, max_features): (200, 8, 0.3) **Trained only on 1000 rows of the dataframe

Comments¶

  1. Logistic regression was the simplest to train. From the CV results we can see that very little overfitting was present (expectable due to very low number of parameters in a logistic regression). The final eval accuracy was still very high at ~94%
  2. Random forest we optimized for hyperparameters. Some overfitting was present due to discrepancy between CV train accuracy (~96%) and CV test accuracy (~94%). Final accuracy of the RF was better than Logistic Regression standing at ~96%
  3. TabPFN was trained only on a ~quarter of the dataset due to complexity limitation. Very high CV train accuracy was achieved (~99%), with lower CV test accuracy (~93%), showing more severe overfitting. However the final accuracy was highest of all models, despite training on the lowest amount of data, standing at ~97%

Appendix¶

Data investigation¶

In [2]:
import numpy as np
import pandas as pd

spambase = pd.read_csv("spambase.csv")
In [3]:
spambase.head()
Out[3]:
Unnamed: 0 word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order ... word_freq_table word_freq_conference char_freq_%3B char_freq_%28 char_freq_%5B char_freq_%21 char_freq_%24 char_freq_%23 capital_run_length_average TARGET
0 0 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 ... 0.0 0.0 0.00 0.000 0.0 0.778 0.000 0.000 3.756 1
1 1 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 ... 0.0 0.0 0.00 0.132 0.0 0.372 0.180 0.048 5.114 1
2 2 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 ... 0.0 0.0 0.01 0.143 0.0 0.276 0.184 0.010 9.821 1
3 3 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.0 0.0 0.00 0.137 0.0 0.137 0.000 0.000 3.537 1
4 4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 ... 0.0 0.0 0.00 0.135 0.0 0.135 0.000 0.000 3.537 1

5 rows × 57 columns

In [4]:
spambase.isna().sum().sum()
Out[4]:
0
In [5]:
df = spambase.drop(spambase.columns[0], axis=1) #Cleaning first column which is just index
df
Out[5]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_table word_freq_conference char_freq_%3B char_freq_%28 char_freq_%5B char_freq_%21 char_freq_%24 char_freq_%23 capital_run_length_average TARGET
0 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 0.00 ... 0.0 0.0 0.000 0.000 0.0 0.778 0.000 0.000 3.756 1
1 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 0.94 ... 0.0 0.0 0.000 0.132 0.0 0.372 0.180 0.048 5.114 1
2 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 0.25 ... 0.0 0.0 0.010 0.143 0.0 0.276 0.184 0.010 9.821 1
3 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.0 0.0 0.000 0.137 0.0 0.137 0.000 0.000 3.537 1
4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.0 0.0 0.000 0.135 0.0 0.135 0.000 0.000 3.537 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4596 0.31 0.00 0.62 0.0 0.00 0.31 0.00 0.00 0.00 0.00 ... 0.0 0.0 0.000 0.232 0.0 0.000 0.000 0.000 1.142 0
4597 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.0 0.0 0.000 0.000 0.0 0.353 0.000 0.000 1.555 0
4598 0.30 0.00 0.30 0.0 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.0 0.0 0.102 0.718 0.0 0.000 0.000 0.000 1.404 0
4599 0.96 0.00 0.00 0.0 0.32 0.00 0.00 0.00 0.00 0.00 ... 0.0 0.0 0.000 0.057 0.0 0.000 0.000 0.000 1.147 0
4600 0.00 0.00 0.65 0.0 0.00 0.00 0.00 0.00 0.00 0.00 ... 0.0 0.0 0.000 0.000 0.0 0.125 0.000 0.000 1.250 0

4601 rows × 56 columns

In [6]:
df.describe()
Out[6]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_table word_freq_conference char_freq_%3B char_freq_%28 char_freq_%5B char_freq_%21 char_freq_%24 char_freq_%23 capital_run_length_average TARGET
count 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 ... 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000
mean 0.104553 0.213015 0.280656 0.065425 0.312223 0.095901 0.114208 0.105295 0.090067 0.239413 ... 0.005444 0.031869 0.038575 0.139030 0.016976 0.269071 0.075811 0.044238 5.191515 0.394045
std 0.305358 1.290575 0.504143 1.395151 0.672513 0.273824 0.391441 0.401071 0.278616 0.644755 ... 0.076274 0.285735 0.243471 0.270355 0.109394 0.815672 0.245882 0.429342 31.729449 0.488698
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.588000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.065000 0.000000 0.000000 0.000000 0.000000 2.276000 0.000000
75% 0.000000 0.000000 0.420000 0.000000 0.380000 0.000000 0.000000 0.000000 0.000000 0.160000 ... 0.000000 0.000000 0.000000 0.188000 0.000000 0.315000 0.052000 0.000000 3.706000 1.000000
max 4.540000 14.280000 5.100000 42.810000 10.000000 5.880000 7.270000 11.110000 5.260000 18.180000 ... 2.170000 10.000000 4.385000 9.752000 4.081000 32.478000 6.003000 19.829000 1102.500000 1.000000

8 rows × 56 columns

In [7]:
X = df.loc[:, df.columns != 'TARGET']
X.head()
Out[7]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... word_freq_edu word_freq_table word_freq_conference char_freq_%3B char_freq_%28 char_freq_%5B char_freq_%21 char_freq_%24 char_freq_%23 capital_run_length_average
0 0.00 0.64 0.64 0.0 0.32 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.0 0.0 0.00 0.000 0.0 0.778 0.000 0.000 3.756
1 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 0.94 ... 0.00 0.0 0.0 0.00 0.132 0.0 0.372 0.180 0.048 5.114
2 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 0.25 ... 0.06 0.0 0.0 0.01 0.143 0.0 0.276 0.184 0.010 9.821
3 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.0 0.0 0.00 0.137 0.0 0.137 0.000 0.000 3.537
4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.0 0.0 0.00 0.135 0.0 0.135 0.000 0.000 3.537

5 rows × 55 columns

In [8]:
y = df.loc[:, df.columns == 'TARGET']
y.head()
Out[8]:
TARGET
0 1
1 1
2 1
3 1
4 1
In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2)

Logistic regression:¶

In [10]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
kf = KFold(n_splits = 5)
In [12]:
clf = LogisticRegression(random_state=2)
clf_scores = cross_validate(clf, X_train, y_train, cv=kf, n_jobs=-1, scoring='accuracy', return_train_score=True)
print("Accuracy: Train: ", np.mean(np.array(clf_scores['train_score'])), " Test: ", np.mean(np.array(clf_scores['test_score'])))
Accuracy: Train:  0.9268115942028985  Test:  0.9219806763285024
In [13]:
clf_final = LogisticRegression(random_state=2).fit(X_train, y_train)
print("Eval accuracy: ", accuracy_score(y_test, clf_final.predict(X_test)))
Eval accuracy:  0.9414316702819957
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\utils\validation.py:1111: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Random Forest¶

In [14]:
from sklearn.ensemble import RandomForestClassifier


modelsRF = []
for n_estim in [50, 100, 200]:
  for max_dep in [2, 5, 8]:
    for max_feat in [0.1, 0.3, 0.5, 0.8]:
      print("Training model with (n_estimators, max_depth, max_features): ", (n_estim, max_dep, max_feat))
      curr_regr = RandomForestClassifier(n_estimators=n_estim, max_depth = max_dep, max_features = max_feat, random_state = 1)
      curr_scores = cross_validate(curr_regr, X_train, y_train, cv=kf, n_jobs=-1, scoring='accuracy', return_train_score=True)
      modelsRF.append((curr_regr, n_estim, max_dep, max_feat, curr_scores))
      print("Accuracy: Train: ", np.mean(np.array(curr_scores['train_score'])), " Test: ", np.mean(np.array(curr_scores['test_score'])))
Training model with (n_estimators, max_depth, max_features):  (50, 2, 0.1)
Accuracy: Train:  0.8749396135265701  Test:  0.870048309178744
Training model with (n_estimators, max_depth, max_features):  (50, 2, 0.3)
Accuracy: Train:  0.8948671497584542  Test:  0.8896135265700483
Training model with (n_estimators, max_depth, max_features):  (50, 2, 0.5)
Accuracy: Train:  0.8936594202898551  Test:  0.8874396135265702
Training model with (n_estimators, max_depth, max_features):  (50, 2, 0.8)
Accuracy: Train:  0.879951690821256  Test:  0.8739130434782609
Training model with (n_estimators, max_depth, max_features):  (50, 5, 0.1)
Accuracy: Train:  0.9237318840579711  Test:  0.9171497584541063
Training model with (n_estimators, max_depth, max_features):  (50, 5, 0.3)
Accuracy: Train:  0.9337560386473431  Test:  0.923913043478261
Training model with (n_estimators, max_depth, max_features):  (50, 5, 0.5)
Accuracy: Train:  0.933695652173913  Test:  0.9214975845410628
Training model with (n_estimators, max_depth, max_features):  (50, 5, 0.8)
Accuracy: Train:  0.9327294685990338  Test:  0.9193236714975844
Training model with (n_estimators, max_depth, max_features):  (50, 8, 0.1)
Accuracy: Train:  0.9503623188405796  Test:  0.9301932367149759
Training model with (n_estimators, max_depth, max_features):  (50, 8, 0.3)
Accuracy: Train:  0.9569444444444445  Test:  0.932608695652174
Training model with (n_estimators, max_depth, max_features):  (50, 8, 0.5)
Accuracy: Train:  0.957669082125604  Test:  0.9304347826086957
Training model with (n_estimators, max_depth, max_features):  (50, 8, 0.8)
Accuracy: Train:  0.9573067632850242  Test:  0.9277777777777778
Training model with (n_estimators, max_depth, max_features):  (100, 2, 0.1)
Accuracy: Train:  0.8795893719806763  Test:  0.8760869565217393
Training model with (n_estimators, max_depth, max_features):  (100, 2, 0.3)
Accuracy: Train:  0.8993961352657006  Test:  0.8946859903381643
Training model with (n_estimators, max_depth, max_features):  (100, 2, 0.5)
Accuracy: Train:  0.8975845410628018  Test:  0.8905797101449275
Training model with (n_estimators, max_depth, max_features):  (100, 2, 0.8)
Accuracy: Train:  0.8850845410628019  Test:  0.8789855072463768
Training model with (n_estimators, max_depth, max_features):  (100, 5, 0.1)
Accuracy: Train:  0.9263888888888889  Test:  0.9169082125603865
Training model with (n_estimators, max_depth, max_features):  (100, 5, 0.3)
Accuracy: Train:  0.9347826086956521  Test:  0.9234299516908212
Training model with (n_estimators, max_depth, max_features):  (100, 5, 0.5)
Accuracy: Train:  0.932548309178744  Test:  0.9229468599033815
Training model with (n_estimators, max_depth, max_features):  (100, 5, 0.8)
Accuracy: Train:  0.932487922705314  Test:  0.920048309178744
Training model with (n_estimators, max_depth, max_features):  (100, 8, 0.1)
Accuracy: Train:  0.9515700483091788  Test:  0.9318840579710145
Training model with (n_estimators, max_depth, max_features):  (100, 8, 0.3)
Accuracy: Train:  0.9583333333333333  Test:  0.9340579710144926
Training model with (n_estimators, max_depth, max_features):  (100, 8, 0.5)
Accuracy: Train:  0.9585144927536232  Test:  0.9318840579710145
Training model with (n_estimators, max_depth, max_features):  (100, 8, 0.8)
Accuracy: Train:  0.9582125603864735  Test:  0.9292270531400966
Training model with (n_estimators, max_depth, max_features):  (200, 2, 0.1)
Accuracy: Train:  0.8794082125603865  Test:  0.8768115942028987
Training model with (n_estimators, max_depth, max_features):  (200, 2, 0.3)
Accuracy: Train:  0.8977657004830919  Test:  0.8939613526570047
Training model with (n_estimators, max_depth, max_features):  (200, 2, 0.5)
Accuracy: Train:  0.9026570048309178  Test:  0.8958937198067634
Training model with (n_estimators, max_depth, max_features):  (200, 2, 0.8)
Accuracy: Train:  0.883816425120773  Test:  0.8801932367149758
Training model with (n_estimators, max_depth, max_features):  (200, 5, 0.1)
Accuracy: Train:  0.9255434782608696  Test:  0.917391304347826
Training model with (n_estimators, max_depth, max_features):  (200, 5, 0.3)
Accuracy: Train:  0.9349033816425122  Test:  0.9251207729468598
Training model with (n_estimators, max_depth, max_features):  (200, 5, 0.5)
Accuracy: Train:  0.9331521739130435  Test:  0.9219806763285024
Training model with (n_estimators, max_depth, max_features):  (200, 5, 0.8)
Accuracy: Train:  0.9329106280193237  Test:  0.9205314009661836
Training model with (n_estimators, max_depth, max_features):  (200, 8, 0.1)
Accuracy: Train:  0.9518719806763285  Test:  0.9314009661835747
Training model with (n_estimators, max_depth, max_features):  (200, 8, 0.3)
Accuracy: Train:  0.9581521739130435  Test:  0.9355072463768115
Training model with (n_estimators, max_depth, max_features):  (200, 8, 0.5)
Accuracy: Train:  0.9588164251207729  Test:  0.9316425120772948
Training model with (n_estimators, max_depth, max_features):  (200, 8, 0.8)
Accuracy: Train:  0.9589371980676329  Test:  0.9323671497584541
In [ ]:
results_chart_RF = pd.DataFrame(columns = ['n_estimators', 'max_depth', 'max_features', 'score_type', 'Accuracy'])
for m in modelsRF:
  _, n_estim, max_dep, max_feat, scores = m
  curr_frame = pd.DataFrame(columns = ['n_estimators', 'max_depth', 'max_features', 'score_type', 'Accuracy'])
  for sc_type in ['test_score', 'train_score']:
    for k in range(5):
      curr_scor = scores[sc_type][k]
      curr_frame = curr_frame.append(
          pd.DataFrame([[n_estim, max_dep, max_feat, sc_type, curr_scor]], columns = ['n_estimators', 'max_depth', 'max_features', 'score_type', 'Accuracy']))
      #print(curr_frame)
  results_chart_RF = results_chart_RF.append(curr_frame)
In [16]:
import plotly.express as px
In [17]:
px.box(results_chart_RF, y='Accuracy', color='score_type', facet_col='max_features', x='max_depth', facet_row='n_estimators')
In [18]:
RF_final = RandomForestClassifier(n_estimators=200, max_depth = 8, max_features = 0.3, random_state = 1).fit(X_train, y_train)
print("Eval accuracy: ", accuracy_score(y_test, RF_final.predict(X_test)))
C:\Users\Antek\AppData\Local\Temp\ipykernel_10616\4069508598.py:1: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

Eval accuracy:  0.9609544468546638

TabPFN¶

In [19]:
from tabpfn import TabPFNClassifier
In [20]:
tabpfn = TabPFNClassifier(device='cpu', N_ensemble_configurations=32)
tabpfn_scores = cross_validate(tabpfn, X_train[:1000], y_train[:1000], cv=kf, n_jobs=-1, scoring='accuracy', return_train_score=True)
print("Accuracy: Train: ", np.mean(np.array(tabpfn_scores['train_score'])), " Test: ", np.mean(np.array(tabpfn_scores['test_score'])))
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Accuracy: Train:  0.98675  Test:  0.9289999999999999
In [21]:
tabpfn_final = TabPFNClassifier(device='cpu', N_ensemble_configurations=32).fit(X_train[:1000], y_train[:1000])
print("Eval accuracy: ", accuracy_score(y_test, tabpfn_final.predict(X_test)))
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
C:\Users\Antek\anaconda3\lib\site-packages\sklearn\utils\validation.py:1111: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

Eval accuracy:  0.9674620390455532